Academic Interest. Trying to build a classification model to predict indivduals who experienced 90 days past due delinquency or worse. The data provided has 10 variables all appropriate for predicting PRobability of Default (PD)
Independent Variable:
SeriousDlqin2yrs (Y): Person experienced 90 days past due delinquency or worse Y/N RevolvingUtilizationOfUnsecuredLines (x1): Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits percentage age (x2): Age of borrower in years integer NumberOfTime30-59DaysPastDueNotWorse (x3): Number of times borrower has been 30-59 days past due but no worse in the last 2 years. integer DebtRatio (x4): Monthly debt payments, alimony,living costs divided by monthy gross income percentage MonthlyIncome (x5): Monthly income real NumberOfOpenCreditLinesAndLoans (x6): Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards) integer NumberOfTimes90DaysLate (x7): Number of times borrower has been 90 days or more past due. integer NumberRealEstateLoansOrLines (x8): Number of mortgage and real estate loans including home equity lines of credit integer NumberOfTime60-89DaysPastDueNotWorse (x9): Number of times borrower has been 60-89 days past due but no worse in the last 2 years. integer NumberOfDependents (x10): Number of dependents in family excluding themselves (spouse, children etc.) integer
## Warning: package 'ggplot2' was built under R version 3.2.3
train <- read.csv("cs-training.csv", header = TRUE, sep = ",", row.names = 1)
test <- read.csv("cs-test.csv", header = TRUE, sep = ",", row.names = 1)
train$flg <- "Y"
test$flg <- "N"
data <- rbind(train, test)
names(data) <- c('y','x1','x2', 'x3','x4','x5','x6','x7','x8','x9','x10','flg')
names(train) <- c('y','x1','x2', 'x3','x4','x5','x6','x7','x8','x9','x10','flg')
names(test) <- c('y','x1','x2', 'x3','x4','x5','x6','x7','x8','x9','x10','flg')
dim(data)
## [1] 251503 12
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.03 0.15 6.05 0.56 50710.00
Data says anything with a ratio > 1 appears to have very high density of bads. As we get closer to the ratio = 1(-ve) the proportions of bads becomes higher. Based on this will treat everything ratio > 1 as a special case. Rest will convert into decile ranges.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 41.0 52.0 52.3 63.0 109.0
The age range is quite large. For now not planning any clean up.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.0000 0.0000 0.0000 0.2436 0.0000 13.0000
There seems to be a about 269 values of 98 of all the records in the training data set. [Plotted as -1 for better clarity on plot & summary]This can be a case of misusing a existing field in a back end system that is used for some other purpose. This is a integer field and can be converted into a ordered factor before modeling.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.2 0.4 353.0 0.9 329700.0
##
## Not Null Null
## In 113036 1827
## Out 7233 27904
Usually debt to income ratios should be smaller than 1. In generally depending on the type of asset product and the economic stress cycle lenders would operate within .35 to .45 debt to income ratio. Since monthly income is also available in the data (x5), we take a decision on what needs to be done with this field after reviewing that variable.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 3400 5400 6670 8249 3009000 29731
The spread of income is quite vast. Offcourse this is a continuous variable and there is really no limitations to what one can earn. But might make sense to cap this variable and look at it. 2 issues really- 1. outliers beyond the Q3+ 1.5*IQR range, outliers below Q1 range 2. Missing values, NAs will have to treated or dropped. Needs to be fixed before x4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 5.000 8.000 8.453 11.000 58.000
There is certain right hand skew on the data. 58 lines of loans and credit is bit too much. May convert into quantiles to handle outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.0000 0.0000 0.0000 0.0885 0.0000 17.0000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.266 0.000 98.000
Yet again there lies a outlier value of 98. This is perhaps a code. We shall have to decide how to treat these values. One option would be convert the integral values into ordered pairs and assign 98 to ‘others’. The Max value otherwise is 17 in itself a very high value.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 1.018 2.000 54.000
## 75%
## 5
This is long tailed. Arguably as the #loans goes up the possibility pf default.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.2404 0.0000 98.0000
There looks like another incidence of 98.Flag it off and convert rest in deciles ##x10: NumberOfDependents
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 0.757 1.000 20.000 3924
Culling some of the outliers to see if some simple relationships does exist
## y x1 x2 x3
## Min. :0.00000 Min. : 0.00 Min. : 0.0 Min. : 0.0000
## 1st Qu.:0.00000 1st Qu.: 0.03 1st Qu.: 41.0 1st Qu.: 0.0000
## Median :0.00000 Median : 0.15 Median : 52.0 Median : 0.0000
## Mean :0.06684 Mean : 6.05 Mean : 52.3 Mean : 0.4211
## 3rd Qu.:0.00000 3rd Qu.: 0.56 3rd Qu.: 63.0 3rd Qu.: 0.0000
## Max. :1.00000 Max. :50708.00 Max. :109.0 Max. :98.0000
##
## x4 x5 x6 x7
## Min. : 0.0 Min. : 0 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.2 1st Qu.: 3400 1st Qu.: 5.000 1st Qu.: 0.000
## Median : 0.4 Median : 5400 Median : 8.000 Median : 0.000
## Mean : 353.0 Mean : 6670 Mean : 8.453 Mean : 0.266
## 3rd Qu.: 0.9 3rd Qu.: 8249 3rd Qu.:11.000 3rd Qu.: 0.000
## Max. :329664.0 Max. :3008750 Max. :58.000 Max. :98.000
## NA's :29731
## x8 x9 x10 flg
## Min. : 0.000 Min. : 0.0000 Min. : 0.000 Length:150000
## 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.000 Class :character
## Median : 1.000 Median : 0.0000 Median : 0.000 Mode :character
## Mean : 1.018 Mean : 0.2404 Mean : 0.757
## 3rd Qu.: 2.000 3rd Qu.: 0.0000 3rd Qu.: 1.000
## Max. :54.000 Max. :98.0000 Max. :20.000
## NA's :3924
## [1] 269
## [1] 63725
## [1] 52558
## [1] 97442
## y x1 x2 x3
## Min. :0.00000 Min. :0.00000 Min. :29.00 Min. : 0.0000
## 1st Qu.:0.00000 1st Qu.:0.03602 1st Qu.:42.00 1st Qu.: 0.0000
## Median :0.00000 Median :0.17225 Median :51.00 Median : 0.0000
## Mean :0.06268 Mean :0.31423 Mean :51.37 Mean : 0.2535
## 3rd Qu.:0.00000 3rd Qu.:0.53553 3rd Qu.:61.00 3rd Qu.: 0.0000
## Max. :1.00000 Max. :1.00000 Max. :78.00 Max. :13.0000
## x4 x5 x6 x7
## Min. : 0.0000 Min. : 1300 Min. : 0.000 Min. : 0.00000
## 1st Qu.: 0.1634 1st Qu.: 3789 1st Qu.: 5.000 1st Qu.: 0.00000
## Median : 0.3100 Median : 5592 Median : 8.000 Median : 0.00000
## Mean : 0.3779 Mean : 6133 Mean : 9.006 Mean : 0.08107
## 3rd Qu.: 0.4817 3rd Qu.: 8047 3rd Qu.:12.000 3rd Qu.: 0.00000
## Max. :95.3009 Max. :14587 Max. :58.000 Max. :17.00000
## x8 x9 x10 flg
## Min. : 0.000 Min. : 0.00000 Min. : 0.0000 Length:97442
## 1st Qu.: 0.000 1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
## Median : 1.000 Median : 0.00000 Median : 0.0000 Mode :character
## Mean : 1.098 Mean : 0.06082 Mean : 0.8916
## 3rd Qu.: 2.000 3rd Qu.: 0.00000 3rd Qu.: 2.0000
## Max. :54.000 Max. :11.00000 Max. :20.0000
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
# x1
library(gtools)
train$x1lab <- ''
train$x1lab[train$x1 > 1]<- 'Ratio > 1'
#which(train$x1lab != 'Ratio > 1')
#train$x1lab <-quantcut( sales$price, seq(0,1,by=0.1) )
# ifelse (train$x5 > quantile(train$x5, na.rm = TRUE,.95), quantile(train$x5, na.rm = TRUE,.95),
# ifelse(train$x5 < quantile(train$x5, na.rm = TRUE,.05),quantile(train$x5, na.rm = TRUE,.05),train$x5 ))